We normalize taxonomic data to proportions by dividing individual counts by sample-specific sequencing depths. All taxonomic profiles are aggregated up to Genus level with phyloseq::tax_glom(). We first imputed all 0 values with a unit pseudocount and then transform the data using center log-ratio transformation implemented in the function clr from the package compositions in R. For untargeted metabolomic data, we transform the data into relative abundances similar to our taxonomic profiles, and then subsequently use the arcsine square root transform to ensure approximate homoscedasticity when applying linear models. For targeted metabolomic data sets, since the data is already in exact concentration format, we simply perform a log (x + 1) transformation, accounting for 0 values. Models are fitted to the transformed data and the resulting predictions are back-transformed to preserve the coverage of predicted metabolite compositions.
metaMDS from the package vegan.GUnifrac from the MiSPU package was used to construct the distance matrix.vegan), however, previous studies have shown that Procrustes analyses are more robust (Jackson 1995).Figure 1. NMDS ordination of samples according to taxonomy (blue) and metabolite (red) profiles for both targeted (left) and untargeted (right) methods in 12M samples. Metabolite ordinations were rotated according to the procrustes procedure to enable testing for concordance between datasets using permutations.
For 12M samples we can see that only targeted metabolite profiles are significantly associated with taxonomic profiles according to the procrustes test(p = 0.001, SS = 0.9602, nperm = 999). However, Mantel tests shows that both metabolite profiles are significantly associated (Untargeted: p = 0.007, R = 0.09111; Targeted: p = 0.046, R = 0.06027)
Figure 2. NMDS ordination of samples according to taxonomy (blue) and metabolite (red) profiles for both targeted (left) and untargeted (right) methods in 6W samples. Metabolite ordinations were rotated according to the procrustes procedure to enable testing for concordance between datasets using permutations.
For 6W samples both types of profiling does not result in significant correlation with taxonomic abundances according to procrustes test. However, mantel tests again show that they are significantly correlated (Untargeted: p = 0.001, R = 0.1487; Targeted: p = 0.03, R = 0.1181).
We fit our machine learning models using the caret package in R. Scaling and centering was performed on the predictor matrix prior to model fitting. We evaluate our models using 10-fold nested cross validation, where within each outer training set we nest a cross-validation procedure for parameter tuning which should prevent overfitting.
Models we compared include elastic net (EN), random forest (RF), support vector machines with radial basis function kernel (SVM), sparse partial least squares (SPLS) as well as multivariate extensions of EN, SPLS and RF. These models were chosen based on prelminiary fit, as well as from existing literature (Pasolli et al. 2016, @borchani2015).
We evaluate our models using four different criterion: Spearman’s rho, Root mean squared error (RMSE), Root relative squared error (RRSE) and Predictive R-squared (R2). In order to contextualize our values, we simulated null distributions of our test statistic by permuting both the predictor and outcome matrices using the function randomizeMatrix in the package picante. The procedure performs permutations while perserving the general structure of the dataset. So far, we’ve only performed null simulations for concentration-fitted data of certain metabolites.
Figure 1 demonstrates the overall RMSE and R-squared scores averaged across metabolites for each model tested. Results show that Random Forest models are much better at prediction than all other models according to both metrics. Therefore, we choose to focus on random forest models as our main prediction tool.
Figure 3. RMSE and R-squared scores averaged across folds then across all metabolites for each model. Models were trained and tested on data at 12M
Figure 4. 10% Trimmed mean values across all cross validation folds of evaluation metrics for each targeted metabolite in the 6W dataset. Red lines indicate median values for that evaluation metric across 1000 null permutations
Figure 5. 10% Trimmed mean values across all cross validation folds of evaluation metrics for untargeted metabolite bins in the 6W dataset
Figure 6. 10% Trimmed mean values across all cross validation folds of evaluation metrics for each targeted metabolite in the 12M dataset. Red lines indicate median values for that evaluation metric across 1000 null permutations
Figure 7. 10% Trimmed mean values across all cross validation folds of evaluation metrics for untargeted metabolite bins in the 12M dataset. Red lines indicate median values for that evaluation metric across 1000 null permutations
Figure 8. Boxplots of each evaluation metric across all metabolites (averaged across all folds). RMSE values were not included due to difference in scale
We chose a cut off of \(\rho = 0.3\) based on a recent study (Mallick et al. 2019) to identify well predicted metabolites. According to that criterion, for 12M samples, around 38.8% of targeted metabolites and 27.4% of untargeted metabolites fit while for 6W samples that number is 69.4% for targeted metabolites and 60% for untargeted metabolites. However when we look at the top performing metabolite for each condition, we have:
| Time | Metabolite Type | Metabolite Id | Evaluation Metric | Value |
|---|---|---|---|---|
| 12M | Targeted | M27 | RMSE | 1633.8803437 |
| 12M | Targeted | M27 | RRSE | 0.9377081 |
| 12M | Targeted | M27 | Correlation | 0.4889715 |
| 12M | Targeted | M27 | R2 | 0.1164403 |
| 12M | Untargeted | M32 | RMSE | 0.0054377 |
| 12M | Untargeted | M32 | RRSE | 0.9075718 |
| 12M | Untargeted | M32 | Correlation | 0.5109847 |
| 12M | Untargeted | M32 | R2 | 0.1623476 |
| 6W | Targeted | M5 | RMSE | 624.3625565 |
| 6W | Targeted | M5 | RRSE | 0.9214036 |
| 6W | Targeted | M5 | Correlation | 0.6538911 |
| 6W | Targeted | M5 | R2 | 0.1450382 |
| 6W | Untargeted | M49 | RMSE | 0.0036279 |
| 6W | Untargeted | M49 | RRSE | 0.9859128 |
| 6W | Untargeted | M49 | Correlation | 0.6821719 |
| 6W | Untargeted | M49 | R2 | -0.0294060 |
Figure 9. Relative abundances of bacterial families (top panel) and targeted metabolites (bottom panel) for 6W samples
Figure 10. Relative abundances of bacterial families (top panel) and targeted metabolites (bottom panel) for 12M samples
Figure 11. Sample distance according to metabolite profiles (both targeted (left) and untargeted (right)) vs according to taxonomic profiles at 6W. Gunifrac (alpha = 0.5) was used to construct sample distance matrix from taxonomic relative abundances while Manhattan distance was used to construct sample distance matrix from principal components of metabolite concentrations.
Figure 12. Sample distance according to metabolite profiles (both targeted (left) and untargeted (right)) vs according to taxonomic profiles at 12M. Gunifrac (alpha = 0.5) was used to construct sample distance matrix from taxonomic relative abundances while Manhattan distance was used to construct sample distance matrix from principal components of metabolite concentrations.
Borchani, Hanen, Gherardo Varando, Concha Bielza, and Pedro Larrañaga. 2015. “A Survey on Multi-Output Regression.” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5 (5): 216–33. https://doi.org/10.1002/widm.1157.
Frost, H. Robert, Zhigang Li, and Jason H. Moore. 2015. “Spectral Gene Set Enrichment (SGSE).” BMC Bioinformatics 16 (1): 70. https://doi.org/10.1186/s12859-015-0490-7.
Jackson, Donald A. 1995. “PROTEST: A PROcrustean Randomization TEST of Community Environment Concordance.” Écoscience 2 (3): 297–303. https://doi.org/10.1080/11956860.1995.11682297.
Mallick, Himel, Eric A. Franzosa, Lauren J. Mclver, Soumya Banerjee, Alexandra Sirota-Madi, Aleksandar D. Kostic, Clary B. Clish, Hera Vlamakis, Ramnik J. Xavier, and Curtis Huttenhower. 2019. “Predictive Metabolomic Profiling of Microbial Communities Using Amplicon or Metagenomic Sequences.” Nature Communications 10 (1): 1–11. https://doi.org/10.1038/s41467-019-10927-1.
Pasolli, Edoardo, Duy Tin Truong, Faizan Malik, Levi Waldron, and Nicola Segata. 2016. “Machine Learning Meta-Analysis of Large Metagenomic Datasets: Tools and Biological Insights.” PLOS Computational Biology 12 (7): e1004977. https://doi.org/10.1371/journal.pcbi.1004977.